This dataset collects information from 100k medical appointments in Brazil. A number of characteristics about the patient are included in each row.
'ScheduledDay' tells us on what day the patient set up their appointment. 'Neighborhood' indicates the location of the hospital. 'Scholarship' indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família. 'No_show' indicates whether the patients showed up for their appointment or not. Then the encoding of the column: it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.
Q1:Does the Neighbourhood of an hospital affect people not turning up for their appointment?
Q2:Does being handicap affect people from showing up for their appointment?
Q3:Do people who recieve sms show up for their appointment?
Q4: Does age affect people's turn-up for their appointment?
#importing all neccessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
# Upgrade pandas to use dataframe.explode() function.
!pip install --upgrade pandas==0.25.0
# loading the dataset into the notebook and then exploring the dataset's properties
df = pd.read_csv('Database_No_show_appointments/noshowappointments-kagglev2-may-2016.csv')
df.head()
Checking the shape of the dataset to know the number of rows and columns
df.shape
Knowing the general description of the dataset
df.describe()
Find the the information of the dataset
df.info()
To check whether there are any duplicates.
sum(df.duplicated())
To get a general graph of all the columns in the dataset
df.hist(figsize=(10,8));
After a careful observation of the properties of the data. Some columns will be dropped, the No-show name changed and the value of the No-show column changed.
#Removing the PatientId, AppointmentID columns from the general dataset
data = ['PatientId','AppointmentID']
df.drop(data,axis=1,inplace= True)
df.tail(20)
#changing the column name of No-show
df = df.rename(columns = {'No-show':'No_show'})
df = df.rename(columns = {'Hipertension':'Hypertension'})
df = df.rename(columns = {'Handcap':'Handicap'})
df.head()
#changing the values of of 'yes' and 'no' to '1' and '0'
df.No_show.replace(('Yes', 'No'), (1, 0), inplace=True)
df.head()
df.query('Neighbourhood== "ILHAS OCEÂNICAS DE TRINDADE"')
df.drop([48754,48765],axis=0,inplace=True)
This cells were dropped because during my exploration on Neighbourhood this paticular cells gave me a value error: could not convert str to float : "ILHAS OCEÂNICAS DE TRINDADE". and I did not know how to fix that, so I droped the cells. But I ran the code in my local server and it did not give me any errors, so I just saw this as the best idea.
# NB: the one's in the column of NO_show actually indicates
#that the patient did not show up for there appointment
df.groupby('Neighbourhood').No_show.mean()
df.groupby('Neighbourhood').No_show.mean().plot(kind='bar',figsize=(20,8))
plt.title('The general overview of people that had an appointment in various hospitals',fontsize=20)
plt.ylabel('Frequency', fontsize=20)
plt.xlabel('Neighbourhood',fontsize=20)
plt.legend();
From the above graph it can be shown how many people actually had an appointment with various hospitals.
#Breaking down the general overview of the graph to have a more deeper insight
#into the neighbourhoods that actually turned up for their appointment and did not turn up for their appointment
disappointed = df.No_show == True
showed = df.No_show == False
#defining hist_chart function
def hist_chart(arg1,arg2,arg3,arg4,arg5):
arg1.hist(alpha = 0.5, bins=20, label = 'showed',figsize =(20,8))
arg2.hist(alpha = 0.5, bins=20, label = 'disappointed',figsize =(20,8))
plt.title(arg3,fontsize=20)
plt.ylabel(arg4,fontsize=20)
plt.xlabel(arg5,fontsize=20)
plt.legend();
#using the hist_chart function
hist_chart(df.Neighbourhood[showed],df.Neighbourhood[disappointed],'An histogram that shows the relationship of the people that showed and did not show in respect to the hospital location.','Frequency','Neighbourhood')
This graph dives deeper to show the people that showed up for their appointment and did not show up for the appointment in the hospital neighbourhood.
Limitaions: I dont know of a better way to seperate the name for a proper visualization.
#creating a pie chart to show the hospital that people showed up for their appointment.
#defining a pie_chart functioin
def pie_chart(arg1,arg2,arg3):
arg1.value_counts().plot(kind = 'pie',figsize = (120,120),fontsize=80)
plt.title(arg2,fontsize=100)
plt.xlabel(arg3,fontsize=100)
plt.legend();
#using the pie_chart funtion
pie_chart(df.Neighbourhood[showed],'Plotting a pie chart to know the hospital that had the largest number of turn-up','Neighbourhood')
From the Piechart I can now deduce the neigbourhood of the hospital that had much people that showed up for their appointment. And Jardim Camburi had the largest number of people that showed
Limitaions: I could not find a better way to make the names not cluster together.
#investigating the properties that made people come for their appointment in Jardim Camburi Location.
df_m = df.query('Neighbourhood == "JARDIM CAMBURI"')
df_m.mean().plot(kind='bar')
plt.title('Showing the general property of the people that had an appointment with JARDIN CAMBURI HOSPITAL',fontsize=15)
plt.ylabel('Frequency',fontsize=20)
plt.xlabel('JARDIN CAMBURI HOSPITAL')
plt.legend();
After viewing the graph it can be shown that Age, Sms were two factors relating to the turn up of people. That is, people within the age of (0-45) and a little relationship with people that recieved Sms.
#investigating the factors that made the neighbourhood show up for their appointment
#Looking at being handicapped
#using the hist_chart function
hist_chart(df.Handicap[showed],df.Handicap[disappointed],'Plotting the graph of Handicap people the showed and did not show up for their appointment','Frequency','Handicap')
The gtaph above shows us that more people that were handicap showed up for their appointment.
#Trying to understand the ratio of Handicap people that showed up for their appointment
print(df.Handicap[showed].value_counts())
#defining pie_chart1 function
def pie_chart1(arg1,arg2,arg3):
arg1.value_counts().plot(kind = 'pie',figsize = (10,10),fontsize=10)
plt.title(arg2,fontsize=15)
plt.xlabel(arg3,fontsize=20)
plt.show();
pie_chart1(df.Handicap[showed],'A piechart of the number of people that showed for their appointment','Handcap[showed]')
From the about chart, it can be seen that people that were not Hamdicap showed up more for there appointment than people that were handicapped.
#This is to check for the ratio of the handicap people that did not show up for their appointment
#using the pie_chart1 function
pie_chart1(df.Handicap[disappointed],'A piechart that shows the number of handicap people that did not show.','Handicap[disappointed]')
Based on the findings, the chart resembles the first chart of the handicap people that showed up for their appointment.Therefore, there is not much relationship with handicap people showing up for their appointment. Because the people that were not handicapped did not also show up for their appointment
#Plotting a general histogram of people who received SMS and showed with people who did not show
#using the hist_chart function defined to plot the graph
hist_chart(df.SMS_received[showed],df.SMS_received[disappointed],'Graph of people who received SMS that showed and did not show for their appointment','Frequency','SMS_received')
This shows that people who received SMS showed up for their appointment than disappointed.But further investigation is needed to understand it better.
print(df.SMS_received[showed].value_counts())
#using the pie_chart1 function
pie_chart1(df.SMS_received[showed],'A piechart indicating people who received SMS that showed for their appointment','SMS_received[showed]')
From the above chart people who did not receive SMS showed up for their appointment than people who actually received the SMS. But still it still shows us that people who receive the SMS shows up.
#investigating how the age of people aftected there turnup for there apointnment
#using the hist_chart function
hist_chart(df.Age[showed],df.Age[disappointed],'Graph of the Age of people that showed and did not show for their appointment','Frequency','Age')
The graph shows that as the age increases there are lesser people who show up for their appointment. And as you can see the (Age 0) has the largest turnup for the appointment. Maybe, because they were newly born babes.
#investigating with a pie chart to really tell us more about which age showed up for their appointment
#using the pie_chart1 function
pie_chart1(df.Age[showed],'A piechart of Age that showed for their appointment','Age[showed]')
This still tell us that the Age 0 were at the hospital for their appointment.
Limitaions: I could not find a better way not to make the values not cluster together.
1.) I found that the Neighbourhood that had the largest number of turn-up was influenced by the age and sms received. The limitation that I observed was that there could have been and information about the address of the people .Because there could be people living very far from the hospital neighbourhood that made them not to come.
2.) I found that being handicap did not affect people for coming to the hospital.
3.) I found that people who recieved sms tends to show up for their appointment.
4.) I found that people at the younger age tend to show up for their appointment than the elderly folks. Then the limitation I saw was that not much information was giving to know why the older folks who are suppose to be the one having regular checkup and not missing their appointment are the ones that did not come for their appointment.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])